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Ready access to emerging databases of gene annotation and func- 
tional pathways has shifted assessments of differential expression in 
DNA microarray studies from single genes to groups of genes with 
shared biological function. This paper takes a critical look at exist- 
ing methods for assessing the differential expression of a group of 
genes (functional category), and provides some suggestions for im- 
proved performance. We begin by presenting a general framework, in 
which the set of genes in a functional category is compared to the 
complementary set of genes on the array. The framework includes 
tests for overrepresentation of a category within a list of significant 
genes, and methods that consider continuous measures of differential 
expression. Existing tests are divided into two classes. Class 1 tests 
assume gene-specific measures of differential expression are indepen- 
dent, despite overwhelming evidence of positive correlation. Analytic 
and simulated results are presented that demonstrate Class 1 tests are 
strongly anti-conservative in practice. Class 2 tests account for gene 
correlation, typically through array permutation that by construction 
has proper Type I error control for the induced null. However, both 
Class 1 and Class 2 tests use a null hypothesis that all genes have the 
same degree of differential expression. We introduce a more sensible 
and general (Class 3) null under which the profile of differential ex- 
pression is the same within the category and complement. Under this 
broader null, Class 2 tests are shown to be conservative. We propose 
standard bootstrap methods for testing against the Class 3 null and 
demonstrate they provide valid Type I error control and more power 
than array permutation in simulated datasets and real microarray 
experiments. 

1. Introduction. DNA microarrays allow researchers to simultaneously 
measure the coexpression of thousands of genes. They are widely used in 
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biology and medicine to study the relationships between transcriptional ex- 
pression and cellular processes or disease states. A primary application of 
microarrays is the identification of genes with differing expression across ex- 
perimental conditions, or having significant association with a clinical out- 
come. Hereafter we will generically refer to the condition or clinical outcome 
as the response for each array, and the association between expression and 
response as differential expression (DE). 

Analyses of DE often proceed in a gene-by-gene manner, in which the asso- 
ciation between the response and the expression of each gene is 
assessed individually. A variety of methods have been proposed, includ- 
ing standard parametric tests, permutation and resampling-based meth- 
ods, and Bayesian techniques [Dudoit et al. (2002), Newton et al. (2004) and 
Tusher, Tibshirani and Chu (2001)]. Using these methods, investigators can 
produce a ranked list of genes significantly associated to the response that 
may account for multiple testing through control of the family-wise error 
rate (FWER) or false discovery rate (FDR). 

Although gene-specific analyses have yielded tremendous insight into the 
role of individual genes, they do not provide a mechanism for identifying 
larger-scale biological phenomena. With the ready availability of compre- 
hensive annotation databases such as Gene Ontology (GO) [Ashburner et al. 
(2000)], researchers can now explore the coordinated involvement of gene cat- 
egories, namely, sets of genes with shared annotation or function. A general 
framework is warranted for evaluating methods that test the associations of 
an entire category to the response of interest, and will allow a more system- 
atic understanding of DE across the genome. 

Beginning with Virtaneva et al. (2001), a number of procedures have been 
presented as ways to assess the association between a response and the ex- 
pression of a gene category. The most commonly used tests begin with a 
list of genes deemed significant and look for over-representation of category 
members within the gene-list, using Fisher's Exact Test or other tests of as- 
sociation for 2 x 2 contingency tables [see Barry, Nobel and Wright (2005) 
for a list of references]. Other approaches more directly use the gene-specific 
measures of DE, rather than collapsing the data to the dichotomous outcome 
of significant association with the response. In these methods tests are con- 
structed to compare the association of genes using the average differences of 
gene-specific statistics [Kim and Volsky (2005) and Boorsma et al. (2005)], 
or rank-based procedures for two-sample comparisons [Mootha et al. (2003), 
Barry, Nobel and Wright (2005) and Ben-shaul, Bergman and Soreq (2005)]. 

Gene category testing is now widely performed, and results are frequently 
reported without independent verification. As pointed out in a recent review 
by Allison, Cui, Page et al. (2006), even fundamental issues such as the for- 
mal definition of the underlying null hypothesis and a proper analysis of 
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Type I error have not been provided for many of the methods in the litera- 
ture. Recent work by Goeman and Buhlmann (2007) addresses some of the 
issues surrounding the assumptions of independence in gene category testing, 
while suggesting that commonly used methods fail to test for a more direct 
hypothesis of DE among the category members. Dudoit et al. (2007) have 
also described a very general framework for hypothesis testing that captures 
most existing methods for testing gene categories and proposes bootstrap- 
based testing. However, there remains a clear need to differentiate among 
existing methods in order to examine their strengths and potential deficien- 
cies, and to place gene category testing on a firm statistical foundation. 

1.1. Contributions. In this paper we provide a careful, extended exam- 
ination of gene category testing and discuss how a standard application 
of bootstrap methodology offers improved performance and flexibility over 
some of the existing methods in the literature. We begin by defining a frame- 
work for gene category testing, which is general enough to include the major- 
ity of the existing methods in the literature. Within our framework, existing 
gene category methods can be divided into two distinct classes of procedures 
as defined by the following null hypotheses: 

1. Gene-specific statistics are independent and identically distributed; 

2. Gene-specific statistics follow a common null distribution, though they 
may be dependent. 

Several shortcomings of these null hypotheses are demonstrated through 
analytic derivations and simulations using an example dataset. 

We propose a broader null hypothesis that allows for arbitrary dependence 
between the expression of different genes, as well as varying degrees of as- 
sociation between the expression of a given gene and the response. Under 
this more general null, array permutation approaches can be quite conser- 
vative. The conservativeness can be explained in part through an analytical 
argument which shows that the maximum variance of the category-wide 
test statistic occurs under the special case induced by array permutation. 
To remedy this problem, we suggest a simple bootstrap-based test that is 
consistent with the general null hypothesis. We demonstrate the utility of 
the bootstrap test on a breast cancer dataset, and discuss other advantages 
that bootstrap-based tests have over array permutation procedures. 

2. Notation and general framework for gene category tests. Let x be an 

mxn matrix containing the observed expression data for an experiment with 
m genes and n arrays. Let Xij be the element of the matrix corresponding 
to the ith gene in the jth array. The expression profile for gene i is the 
row vector x«, and the expression values of array j are represented by the 
column vector x*,-. We use lowercase letters to denote observed values, and 
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uppercase (i.e., X, Xy, Xj*, and X*j) to denote random versions of these 
quantities. The array-specific response information is denoted by y, with 
element yj corresponding to array j. The response can be categorical (e.g., 
tumor grade or experimental group assignment) or continuous (e.g., survival 
time), and could potentially be multivariate. A category is represented by a 
subset C C {1, . . . , m} such that i £ C if and only if gene i is a member of the 
category. The size of a category C will be denoted by mp = Ya^Li € C}. 
For any category C, its complement will be denoted by C, and is of size 
rriQ = m — mc- 

We adopt the terminology of Barry, Nobel and Wright (2005), where it 
is noted that hypothesis tests of gene categories can be viewed as two-stage 
procedures (see Box 1). In the first stage, a local statistic measures the 
association between the expression profile of each gene and the response. 
We denote the local statistic of gene i by = T(Xj*,y) and let ij be the 
corresponding value based on observed data. In a two-condition experiment, 
the local statistic might be a t-statistic or average fold change. For more 
complex datasets, such as those with censored survival data, a local statistic 
derived from a Cox proportional hazard model may be used to test for 
association between gene expression and patient outcome. In many cases, 
T is an estimate of an underlying gene-specific parameter that governs the 
association between response and expression. In the two-condition example 
above, the related parameters would be a scaled difference of means and 
a ratio of population means, respectively. Properties of local statistics are 
examined more fully in Section 5.3. 

In the second stage of a gene category test, a global statistic exam- 
ines the differential expression within the gene category through the col- 
lection of local statistics. The global statistic can be generally denoted by 
U = U(T\, . . . ,T m :C), and in the following sections we describe many of 
the functional forms of U{-) that have been utilized in the literature. In 
the most commonly employed tests of gene categories, U(-) compares the 
local statistics of genes within a category C to those in its complement. 
Methods focus on either detecting a difference in the proportion of genes 
with significant DE, or determining a shift in the average local statistic of 
the category against its complement. Goeman and Buhlmann (2007) have 
argued that comparing a category to its complement creates an unnecessary 
conflict between these methods and the gene-specific tests. However, alter- 
natively proposed methods that directly test the DE within a category have 
their own drawbacks. For many direct tests, the null hypothesis will tend to 
be rejected more often for large categories than small categories. This will be 
true even if the genes in the category are chosen at random. For this reason, 
we will limit our focus to tests that compare a category to its complement. 

Among the current methods for analyzing gene categories, there are var- 
ious ways to classify the tests that have been proposed (e.g., by the choice 
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of global statistic). In terms of Type I error control, we argue that the more 
meaningful distinction is based on the implicit null hypotheses, as described 
in the following sections. Most existing procedures can be roughly divided 
according to whether array permutation is used, but we note that additional 
requirements must be placed on the local statistics in order for the inference 
to be sensible. 

Throughout our paper we treat the categories to which a gene belongs as 
a fixed property of the gene. 

Box 1: Common elements of gene category tests 

Gene category tests are typically two-stage procedures requiring 
the following statistics: 

• A local statistic that measures the association between 
the response (e.g., experimental condition) and the ex- 
pression of each gene. 

• A global statistic that examines the local statistics 
within a category, often in comparison to those of its 
complement. 

For each global statistic there are two broad classes of hypoth- 
esis tests: 

1. Parametric or rank-based procedures that assume inde- 
pendent and identically distributed local statistics, or 
alternatively, gene permutation methods that induce 
the same approximate null. 

2. Array permutation methods which maintain the corre- 
lation structure while inducing a null of no associations 
with the response. 

Error rate controlling or estimating procedures address the mul- 
tiple comparisons from testing many categories. 



3. Class 1 gene category tests. Global test statistics detect an increased 
level of DE among the genes within a category. Many testing procedures 
use traditional methods for comparing independent samples from two pop- 
ulations. In the proposals for these methods, the null hypothesis is rarely 
stated, and without discussion of the appropriateness of the underlying as- 
sumptions. While a variety of global statistics have been employed in these 
tests, and p- values are obtained from both exact and approximate distribu- 
tional assumptions, we note the null hypotheses have a common form. 

Definition 1. A gene category test is of Class 1 if it assumes (or in- 
duces through gene permutation) the null hypothesis that the local statistics 
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Ti, . . . ,T m are independent and identically distributed (i.i.d.), namely, 
(3.1) Hq : Ti,T 2 , . . . ,T m are i.i.d. with Ti ~ F, 

where F can take any form. 

3.1. A survey of global test statistics. The global statistics proposed for 
Class 1 tests fall into two groups. "Categorical" statistics rely on a list of 
significant genes to have been identified by a prior gene-specific analysis, 
while "continuous" global statistics incorporate real- valued measures of DE 
for each gene, without reference to a list of significant genes. To illustrate 
the variety of global statistics that have been proposed in the literature, we 
present two examples from each group and give a brief description of the 
corresponding nonresampling based Class 1 tests. A one-sided form of each 
test is given, because in most applications one is only interested in categories 
showing more association with the response than their complements. We 
note it is conceivable to conduct a one-sided test in the opposite direction; 
for instance, one could look for relative stability within a set of housekeeping 
genes. 

Categorical statistics. Gene-list enrichment methods have been developed 
as a post hoc means of testing a category once the genes with signifi- 
cant DE have been identified. Let V denote the rejection region for local 
statistics that produces the list of significant genes. Categorical methods 
consider only the dichotomous outcomes of the m gene-specific hypothe- 
sis tests, and the extent of DE within C and C can therefore be summa- 
rized by a 2 x 2 contingency table [illustrated in Supplementary Figure 1, 
Barry, Nobel and Wright (2008)]. 

Traditional tests for contingency tables have been utilized in various gene 
category analyses, including the y 2 test of homogeneity, Fisher's Exact test, 
and slight variations on these tests for contingency tables. In the classical 
derivation of such tests, binary variables I{T\ € T}, . . . ,I{T m G T} are as- 
sumed to be independent with probabilities of rejection P(Ti G T) = ire for 
i € C and P(Ti £ T) = -kq for % EC. The tests are designed to have power 
to detect departures from (3.1) of the form -kq > un der the assumption 
that the indicator variables are i.i.d. It is worthwhile to note that the Class 1 
null is sufficient, but not necessary, for the dichotomous outcomes to be i.i.d. 
under a given T. However, (3.1) guarantees the categorical null holds for any 
possible choice of rejection region. We also note that T may at times be de- 
fined in a data-dependent manner, such as when using an error controlling 
procedure in defining the significant gene list. This violates the assumption 
of independent test results, even if expression is uncorrelated between genes. 

The most common test in gene-list enrichment methods is Fisher's Exact 
Test. Formally, this is a conditional test based on the total number of rejected 
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hypotheses, R = YHL\I{Ti € T}. The global statistic can be represented as 
the number of genes in the category that are rejected, namely, 

(3.2) u F = j2nT t ^r}. 

Given R, an exact p-value can be obtained from the hyper geometric distri- 
bution. 

In several gene- list enrichment software packages, the unconditional % 2 
test of homogeneity is proposed as an approximate test for large categories 
[Draghici et al. (2003) and Beifibarth and Speed (2004)]. The one-sided ver- 
sion of this test is equivalent to the difference in proportions test originally 
proposed by Pearson (1911). The associated global statistic can be written 
in the form 

(3.3) u P = ^^ = ^—J2 I {T l er} L_ ^ E r}, 

°P rn c ■ op f^ c m a ■ a P 

where ap is the traditional estimated standard deviation of the difference in 
proportions. Under the Class 1 null, the central limit theorem ensures that 
the two proportions are asymptotically normal for large mc and thq, such 
than a Z-test can be performed on Up. 

Given the variety of methods for generating gene-lists, it is not always 
clear whether it is appropriate to condition on R, but in general, exact tests 
are favored for their ability to handle small categories. For moderately sized 
categories, we note there will be little difference between the exact condi- 
tional and the following approximate unconditional test. For this reason, we 
will restrict our attention to Uf in the simulations performed in Section 4. 

Continuous statistics. In contrast to gene-list type tests, it is also possible 
to directly compare the observed associations of expression and response 
without an intermediate gene list. One straightforward global statistic is the 
average difference in local statistics between a category and complement, 
namely, 

(3.4) u D = = _i_5r;r i £ r„ 

a D m c ■ a D m e ■ a D 

which has power to detect an increase in the expected value of local statistics 
in the category, fie = E[Ti\i € C], relative to the complement, [Iq = E[Ti\i G 
C] [as illustrated in Supplementary Figure 1, Barry, Nobel and Wright (2008)] 
Several hypothesis tests based on the average difference have been proposed, 
including a Z-test performed where dp, is the standard deviation of all m 
local statistics [Kim and Volsky (2005)], and a t-test performed where ap, 
is the pooled sample variance of the local statistics [Boorsma et al. (2005)]. 
In the remainder of this paper we will focus on the i-test version of Up, but 
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note for a typical category where mc <C m, the variance estimates in the 
two approaches will be similar, yielding comparable results. 

The global statistic in (3.4) may not be robust to outliers or skewness in 
the local statistics. Rank-based global statistics avoid this shortcoming, as 
they are invariant to monotone transformations of the local statistics. The 
Wilcoxon rank sum statistic, 



is designed to test a median difference in the two populations of local statis- 
tics and has been implemented in GOStat by BeiBbarth and Speed (2004). 
Under the Class 1 null hypothesis, the discrete CDF of Uw is known once 
mc and m c are specified. Hypothesis testing then proceeds using an exact 
procedure or a normal approximation to Uw- 

A Kolmogorov-Smirnov type global statistic has also been implemented 
in another rank-based Class 1 procedure [Ben-shaul, Bergman and Soreq 
(2005)]. However, the Kolmogorov-Smirnov statistic has been criticized in 
gene category testing for being sensitive to departures that do not necessar- 
ily reflect increasing DE in the category [Damian and Gorfine (2004)]. For 
example, a category with no DE but with local statistics that all happen to 
be nearly identical would be considered significant by these tests. For this 
reason, we restrict our focus to XJjy and Uw when considering continuous 
global statistics. 

3.2. Gene permutation. Several permutation-based methods have pro- 
posed randomly reordering the rows of the data matrix to determine category 
significance [Ashburner et al. (2000), Pavlidis et al. (2004) and Zhong et al. 
(2004)]. In this setup, the collection of local statistics remains unchanged 
while the category assignments are randomized. Gene permutation effec- 
tively induces the Class 1 null hypothesis in (3.1), with the distribution 
of each reassigned local statistic equaling the empirical distribution of all 
observed values, F(t) = m~ l I{U < t}. Exhaustive permutation of the 
gene assignments will be identical to a Fisher's Exact Test of Up and a 
Wilcoxon rank sum test of Uw- Although gene permutation has limited use- 
fulness for global statistics with traditional tests for the null stated in (3.1), it 
has proven to be useful in more complex global statistics [Efron and Tibshirani 
(2007)], and also maintains the correlation between tests of overlapping cat- 
egories, despite inducing independent local statistics. 

We emphasize that gene-permutation procedures (which are also called 
"gene-shuffling") follow a reasonable and principled development, if one is 
willing to assume the category assignments in C are random. Then the null 
hypothesis is that C and the expression data X are independent. Under 
these assumptions, gene permutation reflects inference conditioned on the 
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expression data. However, as we detail later, Class 1 tests, including those 
based on gene permutation, are sensitive to correlation of expression of genes 
within categories, regardless of DE. Such correlation represents a departure 
from the assumption of independence of X and C, but is unrelated to DE. 
Following our perspective that gene category assignments are fixed, we view 
gene permutation procedures as Class 1. 

4. The effect of correlation on Class 1 tests. In this section we examine 
more closely the assumption of independent local statistics, and how vio- 
lations of this assumption effect the performance of Class 1 tests. We note 
that correlation of local statistics arises naturally from correlation of ex- 
pression among genes. A simulation study based on a real microarray data 
set exhibits the extreme anti-conservative behavior of Class 1 tests in the 
presence of realistic levels of correlation in expression. 

4.1. Correlations in expression and local statistics. Let the population 
correlation between genes i and i! be given as pf it = Corr(Xy,.Xj/j). For 
experimental designs with independent arrays, a natural estimate of pf v is 
the sample correlation coefficient, ra>. The distributions of global statistics 
for Class 1 tests are directly affected by the correlation between local statis- 
tics, pf i, = Corr(Tj, Tj/). In the special case that T takes the linear form 
T(Xj*,y) =J2 1 j=i a (yj) ' for some function a(-), it is easy to see that 
PiV = PiV ■ A n example of a linear local statistic would be fold change on 
the log-scale. 

In general, the relationship between pf it and pj it does not have a simple 
analytic form, although it can be shown numerically to often be monotone 
and approximately linear for one-sided local statistics. Indeed, Monte Carlo 
simulations of gene expression data (Figure 1) demonstrate that a nearly 
linear relationship holds for several standard experimental designs and cor- 
responding measures of DE. This includes using a Student's t as the local 
statistic for a two-condition study, and a Wald-type statistic for regress- 
ing expression on censored time-to-event data through a Cox proportional 
hazards model. For such local statistics, pf v ~ pf it so that the sample cor- 
relation coefficients of gene expression can be used as estimates of {pf^} 
in determining the properties of global statistics. However, the two correla- 
tions have a nonlinear relationship for "undirected" local statistics, such as 
an analysis of variance F-statistic [Figure 1(b)]. 

4.2. Variance inflation. The effects of pairwise correlation on Class 1 
tests can be illustrated by deriving the true variances of the global statistics 
Ud and Uyy in the presence of dependence. 
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Student's T-statistic 



ANOVA F-statistic 



CoxPH Wald statistic 





-1.0 -0.5 0.0 0.5 1,0 
Correlation in Expression 



-1.0 -0.5 0.0 0.5 1,0 
Correlation in Expression 




(a) 



(b) 



-1,0 -0.5 0.0 0.5 1,0 
Correlation in Expression 

(c) 



Fig. 1. Correlations between gene expression levels induce correlations in local statis- 
tics. Monte Carlo simulations of standard Gaussian expression for two genes under sev- 
eral experimental designs: (a) a two-sample comparison with a Student's t-statistic; (b) 
four-sample comparison with an ANOVA F-statistic; (c) survival using a Wald test from 
a univariate Cox-proportional hazard model. For each design, sample correlation was ob- 
served across 100 gene pairs and n = 40 arrays with equally sized groups or exponentially 
distributed event and censor times. The median and representative (0.05, 0.95) quantile 
intervals are shown from 200 simulations. Similar results are obtained when simulating 
heteroscedastic genes. 



For the average difference global statistic Up, a simple calculation shows 
that the true variance of the statistic will differ from that under the i.i.d. 
null in Class 1 tests by three additional terms: 

Var[/i c -p c ] 

(4.1) = Vari.i.d.[Ac- Ac] 

/ m c (m c - 1) , m c {m c - 1) m c ■ m c 

x 1 H pc H p c p c c 

\ m m m 

where the quantities 

( 4 -2) pc = — r — ttEE^ 

(43) PC= m*-(L-l) ^ y £<&'' 



i<£Ci'£C 



( 4 - 4 ) pc,c = — - — E E p\v 

are related to the average pairwise correlations within the category (4.2), 
within its complement (4.3), and across the two gene sets (4.4). We note 
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that pc can vary greatly across categories, while p c and Pec w * n be close 
to the average pairwise correlation of all genes on the array and near zero in 
most datasets. For a moderately sized category where uiq ~ m, the ration 
of variances in (4.1), Var[Ac — Pc]/^ ai i.i.d. [Ac — Acl> * s approximately 1 + 
(mc — 1) • pc- This ratio measures the variance inflation of Up, over what 
is assumed by (3.1), and as a consequence, the category exhibiting positive 
correlation will tend to have anti-conservative Class 1 tests of significance 
DE. 

For the Wilcoxon rank sum global statistic, the true variance will depend 
on the common distribution F of local statistics, as defined in (3.1). In 
the special case that local statistics are marginally normally distributed, 
with common mean, unit variances, and pairwise correlations {pf^}, then 
Var[£/w] is given by 

(4.5) Var[[%] ^EEE E ^ ( ^' + ^> ~ $ * ~ ^ ) . 

The derivation of (4.5) is provided in the Supplementary material 
[Barry, Nobel and Wright (2008)] and is analogous to the classic work of 
Gastwirth and Rubin (1971) on the effect of dependence on the Wilcoxon 
rank sum. If the local statistics within a category were all positively corre- 
lated and the complementary set of genes were independent, this variance 
is easily shown to be strictly greater than what is assumed under the Class 
1 null, Varied, [t^y] = m c ' m c ' {m + \)/12. 

Correlation between local statistics will also affect the distributions of 
Up and Up. However, the variance of these categorical global statistics 
in the presence of correlation will further depend on both the distribu- 
tion F and the rejection region V. In the next subsection we present a 
simple simulation study illustrating the effect of gene correlation on Class 
1 tests for annotated categories in a real microarray dataset. The anti- 
conservative behavior of a Class 1 test using the global statistic Up is 
explored in the recent work of Goeman and Buhlmann (2007), who con- 
sidered simulated Gaussian expression data in which the pairwise correla- 
tion of genes is fixed and equal, and categories are of a fixed size. The 
simulation study below attempts to capture the more complicated correla- 
tion structures and variable sizes of functional categories that occur in real 
data. 



4.3. A simulation study. A two-condition experiment was simulated us- 
ing a subset of the lung carcinoma microarrays from Bhattacharjee et al. 
(2001). We first selected 100 adenocarcinoma samples at random from the 
dataset, that contains expression estimates for 7299 genes [see 
Barry, Nobel and Wright (2005) for data pre-processing steps]. Among the 
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P-values loglO(P-values) 

(a) (b) 

Fig. 2. Poor performance of Class 1 tests, (a) Empirical CDFs of pooled p-values (1823 
categories over 1000 simulations) for the Class 1 Fisher's Exact Test, Student's t-test, 
and Wilcoxon rank sum test generated under the null hypothesis in (5.1). (b) The same 
data are plotted on the log-10 scale to demonstrate the disproportionate number of small 
p-values that results from incorrect variance estimates. 



available genes, 1823 GO and Pfam categories were identified with at least 5 
members. The within-category average pairwise sample correlations ranged 
from —0.09 to 0.93, with more than 86% of the categories having values 
greater than the average pairwise correlation across the entire array (0.012). 
This increase in correlation within categories illustrates the common ob- 
servation that coexpression among genes is associated with gene function 
[Lee et al. (2004)]. 

One thousand binary response vectors with equal numbers of zeros and 
ones were generated at random, and used to assign the arrays to one of 
two conditions. The random assignment of arrays ensured that the resulting 
datasets had no association between expression and experimental condition 
for any gene, and thus, no category is expected to have a greater degree of DE 
than any other. We note that the expression matrix is held constant across 
simulations, so that the sample gene-gene correlations remained fixed. 

For each realization of the response vector, a pooled-variance t-statistic 
was used as the local statistic, and global statistics Uf, Ujy and Uw were 
computed. For the Fisher's Exact Test statistic, Uf, the rejection region was 
equal to values exceeding £98,0.95 = 1-66. For each global statistic and each 
category, the different Class 1 tests produced a nominal p-value for every 
generated response vector. Empirical CDFs of the nominal p-values pooled 
across all categories and all realizations demonstrate their extreme nonuni- 
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formity under the induced null hypothesis, confirming the poor performance 
of Class 1 tests (Figure 2). 

The average Type I error of these tests was estimated as the proportion of 
p-values under simulations that fall below a target a level. Table 1 displays 
that, for each global statistic, the corresponding Class 1 tests reject far 
more hypotheses than expected. Moreover, the anti-conservative behavior of 
these tests increases for smaller target a values. Even though the gene-list 
enrichment methods are slightly less anti-conservative than the continuous 
methods, this is offset by their potential loss in power from dichotomizing 
local statistics. 

To illustrate how this behavior also affects the family-wise error rate 
among the L = 1823 categories, we applied a Bonferroni correction to the 
nominal p- values. Since for the randomized data all categories are truly null, 
the FWER is estimated by 



where p^^ is the Class 1 p-value for category h under realization b. There 
is substantial overlap in the membership of gene categories from annota- 
tions such as Gene Ontology, and thus tests will be positively correlated 
accordingly. Therefore, the use of Bonferroni thresholds might be thought 
to be overly stringent in controlling the FWER, providing some protection 
against anti-conservative Class 1 p-values. However, for a = 0.05, the real- 
ized FWER in (4.6) is far greater than the target level (U F ■ 0.776, U D : 0.925 
and Uw '■ 0.918). The extreme anti-conservative behavior of the Class 1 tests 
of all global statistics suggests a different approach is needed to conduct 
valid gene category tests. 

5. Class 2 tests and array permutation. The null hypothesis of Class 1 
tests is violated by the correlations present in gene expression data, and 
we demonstrate the resulting anti-conservative behavior. For this reason, a 
second class of gene category tests is warranted that can identify increases 
in DE within a category, while properly accounting for correlation. 

Definition 2. Class 2 gene category tests are defined by the assumed 
or induced null hypothesis that the local statistics T\,...,T m are possibly 
dependent and identically distributed. More precisely, 

(5.1) Hq : T\, T2, . . . , T m are identically distributed with Tj ~ Fo, 

where -Fo corresponds to a lack of association between expression and the 
response of interest. No assumptions are made about dependence among 
the TjS. 



(4.6) 
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In order for (5.1) to hold in a given experimental design, an appropriate 
form of T(-) must be selected to ensure local statistics are marginally iden- 
tically distributed. In the following section we describe a sufficient property 
of local statistics to induce (5.1) under a global null of all genes having no 
DE. 

5.1. 5 -determined local statistics. In gene category testing, the true as- 
sociation between the expression of an individual gene and a real- or vector- 
valued response can often be summarized by a single fixed parameter 5 
that depends on their (unknown) joint distribution. In the absence of any 
association, the parameter frequently assumes a known null-value. Accord- 
ingly, the local statistic T(-) is chosen for its utility in conducting hypothesis 
tests against the null value of the parameter. To illustrate, consider a two- 
condition experiment where the response vector y takes values of 1 or 2, 
indicating the sample condition of each array. If the expression of gene i has 
expectation fin and fi2i under the two respective conditions, and common 
variance af, then a natural measure of association is the scaled difference in 
means 



Oi ■ y/l/ni + l/n 2 

Here and below, Si denotes the value of the association parameter for gene i. 
In this case, the gene-specific null hypothesis of interest is Hq^ : Si = 0, and 
the pooled- variance t-statistic is a natural choice of local statistic [Galitski et al. 
(1999)]. When the expression levels of gene i are normally distributed, 
and independent from sample to sample, the local statistic follows a t- 
distribution with noncentrality parameter Si, which reduces to a central 
t-distribution when Si = 0. 

In general, a function T(-) is a proper choice of test statistic for a null 
of the form Hq:S = d when the distribution F(T\Si = d) of T given S = 
d is known and does not depend on any nuisance parameters. When the 
distribution of T(-) can be specified in this manner for any choice of d, we 
refer to its distribution as being S- determined. This property is important in 
the basic theory of interval estimation and pivotal quantities; if F(T\S = d) is 
5-determined, it can be used as a pivotal quantity to construct a confidence 
set for S [Casella and Berger (2002)]. In the particular example presented 
above, a Student's t is <5-determined by (5.2). 

The (^-determined property is important when conducting gene category 
tests; if the distribution of T is 5-determined, then differences in nuisance 
parameters do not influence the comparison of a category against its com- 
plement. To illustrate, consider a two-condition experiment with S defined 
as (5.2). Here the gene-specific means and variances of expression are con- 
sidered nuisance parameters. Suppose that for each gene one directly uses 
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the modified ^-statistic from the SAM software [Tusher, Tibshirani and Chu 
(2001)] as the local statistic. This statistic contains a constant in the denom- 
inator that effectively penalizes lowly-expressed genes in order to improve 
the FDR of a gene- list. Due to the presence of this constant, the SAM t- 
statistic is not ^-determined, as its distribution depends on the means and 
variances of gene expression. Now consider a category consisting primarily of 
highly-expressed housekeeping genes. Under a global null in which no genes 
are differentially expressed, and thus no category should be considered sig- 
nificant, highly expressed genes have an increased chance of being ranked 
above lowly-expressed genes when using the SAM statistic. Thus, a category 
with highly expressed genes is more likely to be (falsely) declared significant, 
regardless of whether one uses a categorical or continuous global statistic. 
Categories with lowly-expressed genes would experience the opposite effect. 

When a ^-determined local statistic is chosen and a unique value, do, of 
the parameter corresponds to a lack of association between expression and 
response, the Class 2 null can be restated as Hq : S± = ■ ■ ■ = S m = do. For 
the remainder of the paper, we will only consider local statistics that are 
(^-determined, or approximately so when n is large. 

5.2. Array permutation. If the pairwise correlations {pTi>} of the local 
statistics are known, a Class 2 test can be constructed for the average dif- 
ference statistic Ud using its true variance, as derived in (4.1). Similarly, an 
approximate Z-test for the Wilcoxon rank sum statistic Uw can be designed 
using (4.5) if local statistics are approximately normal. However, since the 
correlations are generally unknown, a particular form of permutation can be 
used as an alternative means of approximately inducing the Class 2 null. 

In many common microarray experiments, each mRNA sample consti- 
tutes an independent unit. By permuting the column vectors of X, or equiv- 
alently the response vector y, an empirical null distribution is achieved in 
which there is no association between gene expression and the response. Ar- 
ray permutation was first used in Virtaneva et al. (2001) to test categories 
of genes, and then implemented in GSEA for a Kolmogorov-Smirnov global 
statistic [Mootha et al. (2003)], and in SAFE for any chosen global statistic 
[Barry, Nobel and Wright (2005)]. More recently, other global statistics have 
been proposed for use with array permutation, including a weighted version 
of GSEA [Subramanian et al. (2005)] and a standardized truncated mean 
that is more sensitive to directional changes [Efron and Tibshirani (2007)]. 
These reports note that array permutation does not change the correlations 
in expression among genes, and thus the gene-specific measures of DE re- 
main dependent. Also, when using array permutation the resampled local 
statistics are conditional on the observed dataset, such that their empirical 
distributions are not exactly identically distributed, and only approximately 
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follow the Class 2 null. However, if one uses the gene-specific empirical p- 
values as local statistics, then every local statistic will exactly follow the 
discrete uniform distribution under permutation, guaranteeing (5.1). 

5.3. Simulated coverage of Class 2 tests. The randomized datasets from 
(4.3) were used to evaluate Class 2 tests of each global statistic based on 
array permutation. Here the tests are ensured to be of proper size, since the 
randomization procedure in the simulation and the array permutation in the 
test both employ the same sampling process. We confirmed this by obtaining 
empirical p-values for each category and each realization of the response 
vector (Table 1). Due to the computational burdens of both simulation and 
permutation, only 1000 resamples were taken for each test, so the smallest 
possible empirical p- value was 0.001. The Class 2 Fisher's Exact Test results 
are notably conservative, due to the numerous tied global statistics that 
occur in small categories. The slight misspecification of Type I error for Up, 
Ud and Uw reflects only sampling variability. These results demonstrate 
Class 2 tests of gene categories generally outperform Class 1 tests based on 
the assumption of gene independence. 

6. A more general null for gene category tests. Although Class 2 tests of 
gene categories appropriately account for the correlation structure of gene 
expression data, they share with Class 1 procedures the shortcoming of 
assuming a null hypothesis under which local statistics are identically dis- 
tributed. This assumption is not necessary when considering whether a gene 
category exhibits a greater degree of differential expression than its comple- 
ment. For example, suppose 20% of the genes are differentially expressed to 



Table 1 

The ratio of realized Type I error rates to target a levels 





Fisher's, Uf 


Student's t, Ur> 


Wilcoxon, Uw 


Class 1 tests 








Q = 0.1 


1.19 


1.82 


1.86 


a = 0.01 


3.40 


5.92 


5.83 


a = 0.001 


13.4 


25.2 


23.5 


a = le-4 


65.6 


130 


116 


a = le-5 


367 


769 


677 


a = le-6 


2213 


4974 


4245 


Class 2 tests 








Q = 0.1 


0.39 


1.01 


1.01 


a = 0.01 


0.21 


1.01 


1.01 


q = 0.001 


0.14 


1.03 


1.01 


a = le-4 


NA 


NA 


NA 
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the same degree (where equals a common nonnull value d) , and the remain- 
ing genes have no association with the response (<5j equals the null value do). 
Any category in which 20% of the genes are differentially expressed should 
not be considered "special." However, the Class 2 null, which is induced by 
array permutation, is clearly violated under this scenario. 

Based on this simple example, we propose the following less restrictive, 
and more biologically realistic, null hypothesis of gene categories. Instead 
of requiring the local statistics of all genes to be identically distributed, we 
allow each to fall into one of K < mc strata that correspond to a different 
marginal distribution for the statistic. No conditions are imposed on the 
dependence of the local statistics. The null can be formalized as follows. 

Definition 3. Let K > 1 and let G\,...,Gk be distinct, fixed dis- 
tributions. Let the local statistics Ti,...,T m have marginal distributions 
Fx,... , F m and let C C {1, . . . , m} be a gene category. Assume that each 
Fi G {Gt,...,G K } and that c ,k = m c 1 Y.i & c I { F i = G k} and 0Q k = 
m^ 1 J^izc = Cfc} are the proportions of genes from C and C, respec- 
tively, whose local statistics are distributed as Gk- The Class 3 null hypoth- 
esis is the following: 

(6.1) H o :(3 c ,k = 0c,k = Pk, k = l,...,K, 

where the distributions G%, . . . , Gk can take any form. 

The Class 1 and Class 2 null hypotheses are then special cases of (6.1) 
with K = 1 stratum. When DE can be assessed in terms of an association 
parameter 5, and the local statistic is ^-determined [e.g., the scaled differ- 
ence in means (5.2) and the Student's t-statistic] , the strata can be directly 
related to different degrees of association with the response. In this case, 
the Class 3 null hypothesis is equivalent to the statement that the empirical 
distributions of 5s in C and G are identical. 

Dudoit et al. (2007) apply a general approach to multiple testing to a fam- 
ily of problems involving simultaneous testing of annotation-based profiles 
using gene expression data. Their work provides a framework for multiple 
testing of associations between what the authors term gene-annotation pro- 
files and gene-parameter profiles. The former include gene categories and GO 
terms. The latter are population based quantities that encompass measures 
of association between the expression of a gene and a binary or continu- 
ous response, including a scaled difference (or ratio) of means. When local 
statistics are <5-determined, the Class 2 and Class 3 nulls considered here 
can be placed within that framework, with the vector of gene-parameter 
profiles playing a role analogous to the gene association parameters 5. We 
note that the Wilcoxon global statistic is not linear in the sense described 
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in Dudoit et al. (2007), and therefore is not covered by their theoretical 
developments. 

In the following subsections we discuss simple bootstrap-based tests that: 
(a) maintain the correlation structure of the expression data and (b) demon- 
strate approximate Type I error control under different realizations of the 
Class 3 null. Simulations of microarray data reveal that tests based on boot- 
strap resampling of arrays clearly outperform array permutation tests that 
induce a Class 2 null. Further examination of the distributional properties 
of Uw and Ud indicate the poor performance of array permutation in this 
more general setting. Finally, we illustrate through simulation the increased 
power of Class 3 tests under a defined set of alternative hypotheses. 

6.1. Defining the bootstrap-based tests. Standard bootstrap methodol- 
ogy assumes that the observed data can be divided into independent units 
derived from an unknown probability model. Resampling from the empir- 
ical distribution of the observed data enables one to form approximate 
confidence intervals without parametric assumptions [Efron and Tibshirani 
(1998)]. For most microarray experiments, the independent sampling unit is 
the joint vector {x*j,yj} containing m gene expression measurements and 
response information for a single mRNA sample. To approximate the un- 
known probability model of the data, bootstrap procedures resample the 
joint vectors with replacement. Let b = (pi, . . . , b n ) be a resampling vector 
whose elements are independent and uniformly distributed over the integers 
{1, . . . ,n}. Associated with b is a resampled response y* b = (yb 1 , ■ ■ ■ ,Vb n ), 
and a resampled expression matrix in which the measurements of gene i 
are given by x* b = (x ibl 

> x ib n ) • From the resampled data, local statis- 
tics tf = T(x* 6 ,y* 6 ), and a global statistic u* b = U(tf, ...,t*£:C) may be 
calculated in the usual way. Let B denote the total number of bootstrap 
samples. 

Standard procedures can be used to generate bootstrap confidence in- 
tervals for the parameter 9 = E\U\. In the context of looking for increased 
amounts of DE, one would define a one-sided confidence interval of size 
a by its lower bound L a . Arguably, the simplest procedure for producing 
the one-sided confidence interval via bootstrap resampling is the quantile 
method [Efron (1979)], in which L a is the sample a-quantile of the resam- 
pled values: u*^. a y The quantile method is straightforward to compute and 
invariant under monotone transformations of the global statistics. However, 
its coverage may be poor when the sample size is small [Efron (1987)], due 
to the difficulty of estimating the tail distribution of the global statistic. 
Alternatively, if one assumes that the global statistic is approximately nor- 
mal, a confidence interval can be generated from the ^distribution using 
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bootstrap-based estimates of the moments of U\y [Efron (1979)]. The re- 
sulting one-sided confidence interval has a lower bound given by 

(6.2) u*-se*{U)- tn-n-a, 
where 

1 B 

(6.3) u* = — J2 u b and se*(U) = 

6=1 

Hypothesis testing with the bootstrap was originally proposed by Efron 
as being applicable when the distribution of a test statistic was unknown un- 
der a null hypothesis (due to nuisance parameters) but could be induced by 
bootstrap-resampling [Efron and Tibshirani (1998)]. Hall and Wilson (1991) 
have proposed that bootstrap-based tests can also be constructed when the 
empirical distributions of resampled statistics do not directly relate to Hq if 
a "pivot test" is used. In the setting of gene categories, this would relate to 
testing for the exclusion of some null value, Oq = Eh [U], from a confidence 
interval defined by the populations of resampled global statistics. These tests 
are of particular use for the Class 3 null, which would be difficult to induce 
directly through the resampling the arrays in a way that also maintains 
gene-correlation. 

We generally favor using the Wilcoxon rank sum global statistic, Uyy, f° r 
gene category tests, because it is a robust and transformation-invariant mea- 
sure of average difference that avoids the arbitrariness the gene-list methods 
have of choosing a rejection region. The following theorem establishes the 
expectation of the Wilcoxon global statistic Uw is a constant, 9q, under all 
realizations that the stratified null hypothesis (6.1). 

Theorem 1. Suppose that for each i the local statistic Tj of gene i has 
distribution Fi £ {G\, . . . , Gk} and that P{Ti = Tj) = for i ^ j. Then for 
any category C C {1, . . . , to} such that @c,k = &a k = fa f or eac ^ stratum 
k = l,...,K, 

(6-4) E[U w ] = mC -^ + 1) . 

(See Appendix A for the proof.) Note that the expectation is constant, 
regardless of the number of strata K, the proportions {/3i, . . . , /3k }, and the 
distributions {G7i, . . . , Gk}- Similar derivations demonstrate that the global 
statistics Ud and Up have a fixed expectation of under (6.1), allowing for 
the construction of a test based on bootstrap resampling. 

The Class-3 bootstrap tests advocated here are based on standard boot- 
strap confidence intervals. Dudoit et al. (2007) also propose bootstrap tests 
within the multiple testing framework considered in their paper. Specialized 
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to the setting of the Class 3 null, their bootstrap procedure is essentially 
similar to the quantile-based method we discuss above. Dodd et al. (2006) 
have also applied bootstrap resampling to category testing in the context 
of differential expression between cancerous and normal human tissue, in 
which a genelist was interrogated with the Pearson-type global statistic, 
Up. Many of the arguments we present in the following sections in support 
of bootstrap testing can be applied to the procedures in Dudoit et al. (2007) 
and Dodd et al. (2006). 

In contrast to the global statistics mentioned above, the expectation of 
the global statistic employed in Fisher's Exact test depends on the K gene- 
specific distributions, and the expectation of the Kolmogorov-Smirnov type 
global statistic used in Mootha et al. (2003) depends on both the marginal 
distribution of the local statistics, and their correlation structure. As a conse- 
quence, one cannot test for exclusion of a null value from confidence intervals 
of these two global statistics. 

The following simulations of various Class 3 nulls were designed to eval- 
uate the Type I and II errors of bootstrap-based tests for both continuous 
global statistics, Uw and Ud- These results demonstrate that array permu- 
tation is clearly inferior in this setting. 

6.2. Type I error under a simulated Class 3 null. The randomized lung 
cancer dataset described in Section 4.3 is extended to evaluate the Type I 
error incurred by permutation- and bootstrap-based tests of Uw and Ud 
under (6.1). A Student's t is used as the local statistic for DE across the 
simulated conditions; under the assumption the gene expression values ap- 
proximately follow normality, each distributional strata is determined by the 
association parameter d given in (5.2). We investigated several null hypothe- 
ses with K = 2 strata of genes relating to no DE (<5j = 0) and positive DE 
(5i = d > 0). To artificially generate different degrees of DE in a particular 
gene, the expression values were first standardized to have variance 1; then 
d ■ \J\jn\ + l/ri2 was added to those values x%j with condition yj = 1. Sim- 
ulations were run using three different levels of DE, d = 1, 3 and 5, and also 
for three proportions of DE, (3 = 1/5,1/3 and 1/2. For each proportion, we 
selected a subset of nonoverlapping categories with the property that (5 ■ mc 
is an integer. This resulted in 41 categories being considered for (5 = 1/5, 40 
categories for = 1/3 and 34 categories for (3 = 1/2. The selected categories 
exhibited a wide range of correlation in expression, reflective of that seen 
across the entire dataset. 

For each of 10000 randomizations of tumor status, the permutation- and 
quantile bootstrap-based hypothesis tests were conducted using 2000 per- 
mutations and resamples, respectively. Simulations indicate that B = 200 
resamples are typically sufficient for the moment estimates in the bootstrap 
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Fig. 3. Performance of gene category tests using bootstrap- quantile (green), bootstrap-t 
(red), and array permutation (blue) methods, for the continuous global statistics Uw and 
Ud, and various Class 3 null hypotheses. The range of Type I errors of different categories 
is shown for (a) four proportions of DE, f3, and (b) for four levels of DE; (c) the aver- 
age Type I errors of Uw and Ud for different a levels under a Class 3 null with d — 3 
and (3 = 1/3; (d) the Type I error for a = 0.05 is plotted for each category against their 
estimated Class 1 variance inflation, 1 + (mc — 1) ■ pc, as derived from (4-1). 



22 



W. T. BARRY, A. B. NOBEL AND F. A. WRIGHT 



t-intervals, and were used accordingly. Type I error was determined by com- 
paring the empirically derived p- values to various a levels (Figure 3). For a 
target a = 0.05, the bootstrap Type I error was only slightly inflated, and re- 
mained relatively unchanged regardless of /3 and d, whereas the Type I error 
of permutation testing dropped dramatically as either parameter diverged 
from 0. For d = 3 and (3 = 1/3, the minimum empirical p- values obtained un- 
der permutation were 0.0055 and 0.001 for Uw and Ud respectively, which 
are several orders of magnitude higher than what would be expected with 
proper Type I error control. 

These findings illustrate the Class 2 tests based on array permutation are 
overly conservative under the broader null. In order to better understand 
the conservative behavior of array permutation, we note that it induces 
a special case of (6.1) under which the local statistics are approximately 
identically distributed (5.1). We return to the variance of the Wilcoxon 
global statistic derived in (4.5), and define the following type of positively 
correlated category. 

Definition 4. For local statistics T\,...,T m with correlations {pfj/}, 
a category C C {1, . . . ,m} will be called correlation dominant if for every 
{i, i'} G C and {h, h'} C it is true that pf ir > pf h and pf h < pj^ h ,, in other 
words, correlations within the category and its complement are greater than 
those between the category and its complement. 

In the following theorem we establish that, for normally distributed local 
statistics, the variance of the Wilcoxon global statistic Uw is maximized 
under the K = 1 null given in (5.1) for all correlation dominant categories. 

Theorem 2. Let T±, . . . ,T m be random variables that follow a multi- 
variate normal distribution with means 5±, . . . , 5 m , unit variances and corre- 
lations {pJii}- For a correlation dominant gene category C, the variance of 
Uw has a global maximum at 8\ = 62 = ■ ■ ■ = 8 m = d. 

Because array permutation induces the special case of (6.1) where the vari- 
ance of Uw is maximum, it is reasonable that the tests will tend to become 
conservative under Class 3 null hypothesis that depart from (5.1). Although 
the complex structure of gene correlation would likely prevent any real cat- 
egory from meeting the strict criterion of being correlation dominant, the 
conservativeness of array permutation tests is seen across all categories in 
the randomized datasets (Figure 3). We have also confirmed these results in 
two-condition datasets simulated from a multivariate Gaussian model (not 
shown). The equal conservativeness of Ud remains to be explored in full de- 
tail, but we suggest that the pooled-variance estimate used in standardizing 
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the global statistics similarly overestimates the true variance in the Class 3 
null hypotheses with more than 1 stratum of DE. 

While in the simulations presented above both bootstrap methods main- 
tained their approximately correct Type I error throughout, we have found 
that in simulated expression data with a smaller sample size of n = 20 arrays, 
the quantile-based method becomes more anti-conservative at small target a, 
since many microarray datasets can be of this size, and the quantile-interval 
also requires more resamples to conduct tests with small a. Further, as Uw 
is the sum of mp • ( m — mc) pairwise comparisons of local statistics, approx- 
imate normality follows from the Central Limit Theorem when the average 
correlation between these terms is not extreme and mc is large. Histograms 
of resampled global statistics confirm that the approximation to the normal 
distribution is appropriate for the large number of genes in a typical mi- 
croarray experiment. Therefore, we prefer the bootstrap Student's t-interval 
for more general use in Class 3 tests of gene categories. 

6.3. Power under simulated alternatives. To assess the relative power of 
the bootstrap tests over array permutation, alternative hypotheses must be 
specified that relate to increased amounts of DE in a gene category. When the 
differential expression of a gene can be measured in terms of an association 
parameter 5, an average increase of DE within the category relative to its 
complement can be written as 



For these alternatives, the Wilcoxon rank sum Uw is well suited to identi- 
fying increased amounts of DE in a robust manner. 

In the randomized lung carcinoma dataset, realizations of (6.5) can be 
achieved by applying an additive or multiplicative constant to all gene- 
specific parameters within the category. More precisely, if {5® : i 6 C} are 
the association parameters of a category under the Class 3 null, we consider 
H A to be either of the form {Sf = c+8® : i 6 C} or {Sf = c- 5° : i E C}. In this 
way, power curves can be displayed across a single axis by varying c. Figure 
4 illustrates the effects when the constant is applied in an additive manner 
for K = 2 strata of DE and nonDE genes, and in a multiplicative manner 
for an example with K = 5 strata. The results demonstrate considerable 
improvements in power of the bootstrap methods over array permutation. 

7. Analysis of a survival microarray dataset. The breast cancer survival 
dataset from Chang et al. (2005) are used to illustrate the utility of boot- 
strap resampling as compared to array permutation. A total of n = 295 
breast cancer samples were analyzed on Agilent microarrays, and normal- 
ized gene expression estimates were obtained for a subset of m = 11176 genes 
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Fig. 4. Average power of permutation- and bootstrap-based gene category tests as one 
departs from Class 3 nulls. Results based on randomized microarray data and real GO 
categories, and applying (a) an additive constant to K — 2 classes of genes with 1/3 
differentially expressed at d — 1 (as shown by the CDF in the inset graphs), and (b) a 
multiplicative constant to K = 5 classes of genes with {dk} equally spaced between and 
1. Both scenarios exhibit more power to detect the alternative using the bootstrap tests. 



that are annotated to at least one of 1348 GO terms (details on normaliza- 
tion, filtering, and formation of gene categories are omitted, but available 
from the authors) . Survival times and censoring indicators were available for 
each array. Wald statistics from univariate Cox proportional hazard models 
were used as local statistics to reflect the association between expression and 
patient outcome. 

We employed the Wilcoxon rank-sum Uyy as a the global statistic for 
the permutation and bootstrap-based tests; results were obtained from 1000 
permutations/resamples of the data, respectively. The p- values produced by 



Table 2 

Number of significant GO categories for target a 
levels 







Perm 


Boot-q 


Boot-t 





= 0.1 


195 


222 


220 


a 


= 0.05 


129 


157 


160 


a 


= 0.01 


56 


72 


85 


a 


= 0.005 


36 


63 


7:5 





= 0.001 


12 


40 


48 





= 3.7e-5* 


NA 


NA 


28 



*Bonferroni cutoff. 
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the bootstrap quantile and i-intervals were in good agreement across the 
set of categories (rank correlation > 0.999), reflecting the fact that the dis- 
tributions of resampled global statistics were nearly normally distributed. 
The permutation test also showed good agreement with the bootstrap (rank 
correlation of 0.977 with bootstrap results), but a distinct difference was 
observed in the number of categories achieving various levels of significance 
(Table 2). The improved power of the bootstrap methods is evidenced by 
the increased number of significant categories, with 48 declared significant 
via bootstrapping at a = 0.001, but only 12 declared significant via per- 
mutation. The minimal possible p-value of the permutation and bootstrap- 
quantile tests are limited by the 1000 resamples that were taken of the data. 
The bootstrap i-interval does not have this restriction, and 28 categories 
were observed to pass the conservative Bonferroni threshold for a = 0.05. 
Because of the iterative procedure for estimates from the Cox-proportional 
hazard model, taking additional resamples of the dataset was computation- 
ally infeasible, and would be prohibitive when trying to control the FWER 
across such a large number of categories. 

8. Discussion. We have used the terminology of local and global statis- 
tics as presented in SAFE [Barry, Nobel and Wright (2005)] to describe ex- 
isting methods for testing differential expression within a gene category. By 
classifying methods according to their assumed null hypotheses, we illustrate 
a number of shortcomings of these methods. We propose a novel bootstrap- 
based approach that uniquely allows both for genes within a category and 
its complement to be correlated, and that maintains proper error control 
under a more biologically sensible null hypothesis than has been implicitly 
used by other methods. 

As a last but very important advantage to the bootstrap-based procedure, 
we note that by resampling with replacement, the bootstrap can incorpo- 
rate covariate information in a sensible manner. In permutation testing, by 
inducing a null that breaks the association between the response and ex- 
pression, the covariate information can no longer be linked to both. Thus, a 
researcher is forced to choose the part of the data to remain linked to the co- 
variate. By resampling the data jointly, the bootstrap allows the relationship 
between all three variable types to be maintained. The proper consideration 
of covariates is just one area of potential improvement, as gene category 
testing moves toward greater statistical maturity. 

APPENDIX A: PROOF TO THEOREM 1 

The following elementary lemma is useful in evaluating the expectation 
of U\v- 
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Lemma A.l. Let T\ and T 2 be distributed as G\ and G 2 and assume 
that P(Ti = T 2 ) = 0. Define n(G 1 ,G 2 ) = £[/{Ti > T 2 }], then p(G x ,G 2 ) = 
l-/x(G 2 ,Gi) and /i(G 1 ,G 2 ) = 1/2 when G X = G 2 . 

The expectation of Uw is calculated by decomposing the mc ■ mc pairwise 
comparison of Ts into the K 2 different terms involving p(G k , G k ')'- 
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^ Rank(Tj' 
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+ EE E E g*) 
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mc • (mc + 1) mc ■ mc mc ■ (m + 1) 
~~ 2 h 2 ~~ 2 ' 

such that is invariant to the number of strata K, their proportionate 

sizes . . . , and local statistic distributions {Gi, . . . Gk}- 

APPENDIX B: PROOF TO THEOREM 2 

The following lemma regarding the bivariate normal distribution is useful 
for establishing an inequality for Var[£/j^]. 



Lemma B.2. For the bivariate normal distribution, the following is true 
for the function f(x, y) = $> 2 (x, V\ p) — $(x) ■ &(y)' 

1. /(0,0) is a global maximum when p > 0, 

2. /(0, 0) is a global minimum when p < 0, 
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3. f(x, y) = when p = 0. 



Proof. The first derivatives of f(x,y) are 



dx 



x,y) 



d_ 

dx 



y- px 



and ^ has an analogous form due to symmetry. Since $ is a strictly in- 
creasing function, setting the derivatives equal to zero leads to the following 
equations: 



y- px 



x- py 



i-p 2 • y, 



1 — p 2 ■ x, 



for which {x = 0,y = 0} is the only solution when p ^ 0. Since (0,0) is the 
only stationary point, a second derivative test can be used to determine 
whether it is a global minimum or maximum [Thomas and Finney (1992)]. 
The second derivatives are solved to be 

d 2 f, 



dx 2 



dy 2 



d 2 f 
dx dy 



{x,y) =4>'{x) 



y- px 

y- px 
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dy dx 



(x,y). 



At the point {x = 0, y = 0} the derivatives are equal to 

~P 



d 2 f d 2 f 

J (0,0) = -4(0,0)=^0) 2 



(B.l) 



dy 2 
d 2 f 



(0,0) 



dx 2 

d 2 f 



dx dy dy dx 

and the discriminant takes the form 



D(0,0) = S(0,0)-g(0,0) 



(0,0) =0(0) • 
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(B.2) 



m 2 

0(O) 4 



-P 



0(0) 



p 2 (1 _ f 



0(0) 
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1-p 2 



0(O) 4 -2- 
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Since \/l — p 2 > (1 — p 2 ) for all nonzero p € (—1, 1), (B.2) is strictly posi- 
tive, proving that either a minimum or a maximum must exist. Prom the 
second derivatives in (B.l), one can show that /(0,0) is a minimum when 
p < and a maximum when p > 0. Last, f(x,y) is exactly when p = by 
independence. □ 

The variance of J7vt/ can be decomposed using the Mann-Whitney form 
of the statistic: 



Var[[7, 



in 



(B.3) 



Var 



Var 



^Rank^) 
m c ■ (m c + 1) 



= E E E E Cov [^ > T h },m> > ivy], 

where 

CovI/^T/J,/^^}] 

= > T h } ■ I{T V > T h ,}] - E[I{T t > T h }] ■ E[I{T V > T h ,}} 

= P({T h -Ti<0}n {T^ - T v < 0}) - P(T h — Tj < 0) 
x P(T h i - Tii < 0). 

Under (6.1) and the Gaussian assumption, the paired differences in lo- 
cal statistics follow noncentral bivariate normal distributions with marginal 
means 5^ — 5i and 5h> — 5^. Each covariance term can be written as 



(B.4) 



®2(S h -8i,6 h >-6i>;p)-&(6 h -6i)-§(6 h '-6v), 



where $ and <3?2 represent the CDFs of a univariate and bivariate normal 
distributions with unit variance, and p is defined as 



(B.5) 
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We consider in turn the several forms of p that arise in (B.4). 

When i = i' and h = h', p is proportional to 2 — 2 • pj h , which is a positive 
quantity except when the genes are perfectly correlated which is ruled out 
by the definition of a correlation dominant category. From Lemma B.2, (B.3) 
is maximized when Si = Sh- Since this is true for all {i, h} pairs of category 
and complement genes, a global maximum of the summed covariances will 
occur when all local statistics have the same mean. 

When i = i' and h^h' ', p is proportional to 1 + pj v — pf, h — pf h , and will 
be greater than for a correlation dominant category such that a maximum 
occurs when 5i = 5h = Sh' . An analogous argument holds for when i ^ i' and 
h = ti. 

For i ^ i' and h^h', either p will be positive if (pj v + p hh ,) > {pj, ' h + pj h ,) 
so that (B.4) is maximized when 5h = Si and Sy = Sy, or p will be exactly 
^ (pfi' + Phh') = (pJ' ' h + Pfh') an< ^ 0^-^) wm be constant. This inequality of 
summed correlations is again guaranteed for correlation dominant categories. 

This proves a global maximum for Var[C/iy] is achieved at 5\ = 82 = ■ ■ ■ = 
S m = d since only in this case will every covariance term in (B.3) be either 
maximized, or a constant. 

SUPPLEMENTARY MATERIAL 

Supplement A: Measures of differential expression in gene category test- 
ing (Figure) (doi: 10.1214/07-AOAS146SUPPA; .pdf). In gene category test- 
ing, global statistics typically fall into two groups: "categorical" statistics 
that rely on a list of significant genes to be identified, and "continuous" 
statistics that incorporate real-valued measures of gene-specific differential 
expression. The following figure illustrates the two data types using the no- 
tation framework described in the article. 

Supplement B: Variance of the Wilcoxon rank sum statistic under cor- 
relation (Theorem) (doi: 10.1214/07-AOAS146SUPPB; .pdf). The variance 
of the Wilcoxon rank sum global statistic in equation (4.5) is derived in the 
following theorem under the assumption of dependent and identically dis- 
tributed Gaussian local statistics. The proof is presented using the notation 
framework described in the article, and is analogous to the classic work of 
Gastwirth and Rubin. 
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